{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "df9622ec-0360-4a43-a745-f52bded50eb5",
   "metadata": {},
   "source": [
    "# Hands-On Data Analysis with DuckDB\n",
    "\n",
    "## Technical requirements\n",
    "\n",
    "In order to run the examples in this notebook, you'll need to install the Python dependencies for this project. You can do this by running the following command in your terminal when in the root directory of the project. Note that ideally this should be using a Python virtual environment for this project.\n",
    "\n",
    "    pip install -r requirements.txt\n",
    "\n",
    "For complete instructions on how to set up your environment for working through the examples, please consult the *Technical requirements* section of the chapter *Setting up the DuckDB Python Client*."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "237acc22-3f65-4bd2-9152-f11d04b44824",
   "metadata": {},
   "source": [
    "## Preparing the pedestrian traffic dataset for analysis "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ba71cbc-5712-49fe-920b-096815e95862",
   "metadata": {},
   "source": [
    "### Establishing the data processing steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cad9f79a-ce83-43b0-8684-8948db2c5da9",
   "metadata": {},
   "outputs": [],
   "source": [
    "import duckdb\n",
    "\n",
    "records = duckdb.read_csv(\"pedestrian_records_2009-2022.csv\") \n",
    "\n",
    "records.show(max_width=200) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "73e7c521-39e5-4cfa-ba60-5077c8211fc7",
   "metadata": {},
   "outputs": [],
   "source": [
    "records = duckdb.read_csv( \n",
    "    \"pedestrian_records_2009-2022.csv\", \n",
    "    dtype={\"Date_Time\": \"TIMESTAMP\"}, \n",
    "    timestamp_format=\"%B %d, %Y %H:%M:%S %p\", \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b4f88bde-bdc9-4eb8-a894-fd401b6dac2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "records.limit(5).show(max_width=200)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20f566b2-8045-40ba-bb9b-ab98c876923b",
   "metadata": {},
   "outputs": [],
   "source": [
    "transformed = records.select(\"* EXCLUDE ID\").sort(\"Date_Time\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f309ce93-5078-4abf-b3a3-df3a3eb0fdbb",
   "metadata": {},
   "outputs": [],
   "source": [
    "transformed.limit(5).show(max_width=200)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba7eef65-d1ce-427d-85fb-175bae8c1e78",
   "metadata": {},
   "source": [
    "### Loading the prepared dataset into a database table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "90b5c7d5-721b-45e9-b7cf-09b51c861b7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "with duckdb.connect(\"pedestrian.duckdb\") as conn:\n",
    "    result = (\n",
    "        conn.read_csv(\n",
    "            \"pedestrian_records_2009-2022.csv\",\n",
    "            dtype={\"Date_Time\": \"TIMESTAMP\"},\n",
    "            timestamp_format=\"%B %d, %Y %H:%M:%S %p\",\n",
    "        )\n",
    "        .select(\"* EXCLUDE ID\")\n",
    "        .sort(\"Date_Time\")\n",
    "    )\n",
    "    result.to_table(\"pedestrian_counts\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b51f95c9-6391-4e4e-a7b9-b37edd55976c",
   "metadata": {},
   "source": [
    "## Effective data analysis using Jupyter Notebooks "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a55e80ef-51b0-4c5f-a312-a2786541fa6e",
   "metadata": {},
   "source": [
    "### Convenient SQL queries with JupySQL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14aeab32-4b3e-40d9-bbd9-6a7d2163369c",
   "metadata": {},
   "outputs": [],
   "source": [
    "conn = duckdb.connect(\"pedestrian.duckdb\")\n",
    "\n",
    "conn.sql(\n",
    "    \"\"\"\n",
    "    SELECT sum(Hourly_Counts) AS Total_counts\n",
    "    FROM pedestrian_counts\n",
    "    WHERE Year = 2022 AND Sensor_Name = 'Melbourne Central'\n",
    "    \"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cb2db40-9bc8-40a2-9d89-45228224261c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure JupySQL to use the default DuckDB database\n",
    "%load_ext sql \n",
    "conn = duckdb.connect() \n",
    "%sql conn --alias duckdb "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea9c6805-0921-4277-ad1c-e0ac285e4bbd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# for this exercise, we'll configure JupySQL to use a connection to the on-disk database\n",
    "conn = duckdb.connect(\"pedestrian.duckdb\")\n",
    "%sql conn --alias pedestrian.duckdb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8801c8ab-c0ee-4179-8d7c-9713ce23a1d2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure JupySQL to return Pandas dataframes by default\n",
    "%config SqlMagic.autopandas = True "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f4fef492-466f-4ea6-a971-5ec9bb3592e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql \n",
    "SELECT sum(Hourly_Counts) AS Total_Counts\n",
    "FROM pedestrian_counts \n",
    "WHERE Year = 2022 AND Sensor_Name = 'Melbourne Central' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4291e4b-ae45-4336-92f9-c4141c7fa796",
   "metadata": {},
   "outputs": [],
   "source": [
    "type(_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b43b349b-e3c5-4325-815e-2726127a1779",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql sensors_2022_df << \n",
    "SELECT Sensor_Name,\n",
    "    sum(Hourly_Counts)::BIGINT AS Total_Counts \n",
    "FROM pedestrian_counts \n",
    "WHERE Year = 2022 \n",
    "GROUP BY Sensor_Name \n",
    "ORDER BY Total_Counts DESC "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18e78ff5-319e-46c5-88c8-52d9f7ab9c11",
   "metadata": {},
   "outputs": [],
   "source": [
    "sensors_2022_df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "69872690-2016-439c-8502-a60e945fa789",
   "metadata": {},
   "outputs": [],
   "source": [
    "sensors_2022_df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e4c7167-9725-4224-9403-e19e98b761ce",
   "metadata": {},
   "source": [
    "### Interactive visualisations with Plotly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5310100-c75e-4f22-81b5-9805d261e8f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import plotly.express as px"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd670f36-9be3-4c9a-b640-e6359d5dc769",
   "metadata": {},
   "outputs": [],
   "source": [
    "figure = px.bar( \n",
    "    sensors_2022_df.head(10),\n",
    "    x=\"Sensor_Name\",\n",
    "    y=\"Total_Counts\",\n",
    "    height=700,\n",
    "    title=\"Top 10 sensors by traffic for 2022\",\n",
    ")\n",
    "\n",
    "figure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7c0a13b5-4d88-435f-b571-55bd060fd1d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql sensor_years_df << \n",
    "SELECT Year, COUNT(DISTINCT Sensor_Name) AS Total_Sensors \n",
    "FROM pedestrian_counts \n",
    "GROUP BY Year \n",
    "ORDER BY Year "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb16d3fd-2ef3-4fe3-847b-f8adc687474d",
   "metadata": {},
   "outputs": [],
   "source": [
    "sensor_years_df.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b403df36-9142-44c4-a0da-bde734990cf5",
   "metadata": {},
   "outputs": [],
   "source": [
    "figure = px.line( \n",
    "    sensor_years_df, \n",
    "    x=\"Year\",\n",
    "    y=\"Total_Sensors\",\n",
    "    markers=True,\n",
    "    height=500,\n",
    "    title=\"Total number of active sensors by year\"\n",
    ")\n",
    "\n",
    "figure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "63bd9d4b-3daf-4652-9c65-56e8a35c409e",
   "metadata": {},
   "outputs": [],
   "source": [
    "figure.update_layout(xaxis={\"dtick\": 1}, title={\"x\": 0.5})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "02440be4-1511-494f-b195-4d6bd776d746",
   "metadata": {},
   "outputs": [],
   "source": [
    "figure.update_layout(xaxis_dtick=1, title_x=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2a9820f-be33-46de-85bd-7b8e343483f4",
   "metadata": {},
   "source": [
    "## Analysing pedestrian traffic through Melbourne CBD"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "906721b2-5877-4c0b-a54f-99493e536914",
   "metadata": {},
   "source": [
    "### Visualizing total pedestrian counts over time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aed275c6-d8fa-48fa-8324-b9f6c6d8ca85",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql year_counts_df <<\n",
    "SELECT Year, sum(Hourly_Counts)::BIGINT AS Total_Counts\n",
    "FROM pedestrian_counts\n",
    "GROUP BY Year\n",
    "ORDER BY Year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae217367-393a-4e98-ae54-115f6201bc11",
   "metadata": {},
   "outputs": [],
   "source": [
    "px.line(\n",
    "    year_counts_df,\n",
    "    x=\"Year\",\n",
    "    y=\"Total_Counts\",\n",
    "    markers=True,\n",
    "    height=500,\n",
    "    title=\"Yearly traffic across all sensors\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84371482-c817-48aa-b2fa-43cf78e3334e",
   "metadata": {},
   "outputs": [],
   "source": [
    "%sql SELECT count(DISTINCT Year) FROM pedestrian_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1f48f9b-0681-4fd8-a424-70e13daa4242",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql\n",
    "CREATE OR REPLACE TABLE common_sensors AS\n",
    "SELECT Sensor_Name\n",
    "FROM pedestrian_counts\n",
    "GROUP BY Sensor_Name\n",
    "HAVING COUNT(DISTINCT Year) = 14;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1326cb92-c8d0-4afb-becd-c91aceb25231",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql year_counts_filtered_df <<\n",
    "SELECT Year, sum(Hourly_Counts)::BIGINT AS Total_Counts\n",
    "FROM pedestrian_counts\n",
    "WHERE Sensor_Name IN (FROM common_sensors)\n",
    "GROUP BY Year\n",
    "ORDER BY Year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59efa9bd-43d4-46f5-90eb-6683c74e98d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "px.line(\n",
    "    year_counts_filtered_df,\n",
    "    x=\"Year\",\n",
    "    y=\"Total_Counts\",\n",
    "    markers=True,\n",
    "    height=500,\n",
    "    title=\"Yearly traffic for sensors active all years\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "958adac1-e7dc-4d1c-9295-cd8e25495ab1",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql year_month_counts_df << \n",
    "SELECT\n",
    "    Year,\n",
    "    Month,\n",
    "    month(Date_Time) AS Month_Num,\n",
    "    sum(Hourly_Counts)::BIGINT AS Total_Counts\n",
    "FROM pedestrian_counts\n",
    "WHERE Year IN (2019, 2020, 2021)\n",
    "    AND Sensor_Name in (FROM common_sensors)\n",
    "GROUP BY Year, Month, Month_Num\n",
    "ORDER BY Year, Month_Num"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b172ed35-2e51-4f14-b7c2-04795a17df3f",
   "metadata": {},
   "outputs": [],
   "source": [
    "year_month_counts_df.head(15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee0343b9-a46d-4b5c-b86b-232e69cd8aa1",
   "metadata": {},
   "outputs": [],
   "source": [
    "px.line(\n",
    "    year_month_counts_df,\n",
    "    x=\"Month\",\n",
    "    y=\"Total_Counts\",\n",
    "    color=\"Year\",\n",
    "    symbol=\"Year\",\n",
    "    symbol_sequence=[\"square\", \"diamond\", \"circle\"],\n",
    "    markers=True,\n",
    "    height=500,\n",
    "    title=\"Monthly traffic for sensors active 2019-2021\",\n",
    ").update_traces(marker_size=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6dea600-ef87-4593-ba1d-1a47842cc770",
   "metadata": {},
   "source": [
    "### Time series plots of sensors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1c0b0bc-63c8-42f6-af7b-f8979e3a1131",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql sensor_2020_df << \n",
    "SELECT Hourly_Counts, Date_Time \n",
    "FROM pedestrian_counts \n",
    "WHERE Sensor_Name = 'Flinders La-Swanston St (West)'\n",
    "    AND Year = 2020"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "644d370b-4310-4926-92e4-4b98296abd24",
   "metadata": {},
   "outputs": [],
   "source": [
    "config = {\n",
    "  'toImageButtonOptions': {\n",
    "    'format': 'png',\n",
    "    'filename': 'custom_image',\n",
    "    'height': 500,\n",
    "    'width': 1200,\n",
    "    'scale':5\n",
    "  }\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eec98ae7-06cb-48f7-9517-02eab161cda8",
   "metadata": {},
   "outputs": [],
   "source": [
    "px.line(\n",
    "    sensor_2020_df,\n",
    "    y=\"Hourly_Counts\",\n",
    "    x=\"Date_Time\",\n",
    "    height=500,\n",
    "    title=\"Hourly traffic for Flinders La-Swanston St (West)\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0a27e206-b7a0-4d93-b8ae-1b6e2889053f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql multi_sensor_df <<\n",
    "SELECT Sensor_Name, Hourly_Counts, Date_Time\n",
    "FROM pedestrian_counts\n",
    "WHERE Sensor_Name IN (\n",
    "        'Flinders St-Spark La',\n",
    "        'Bourke Street Mall (North)',\n",
    "        'Southern Cross Station'\n",
    "    )\n",
    "    AND Year = 2019\n",
    "    AND Month = 'September'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e0df35d3-5105-470b-8aaf-e096d4a6acda",
   "metadata": {},
   "outputs": [],
   "source": [
    "px.line(\n",
    "    multi_sensor_df,\n",
    "    y=\"Hourly_Counts\",\n",
    "    x=\"Date_Time\",\n",
    "    facet_col=\"Sensor_Name\",\n",
    "    facet_col_wrap=1,\n",
    "    title=\"Hourly pedestrian traffic for December 2019\",\n",
    "    height=800,\n",
    ").update_layout(yaxis_fixedrange=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67e12180-07ea-4153-b1bd-ec0368b5f456",
   "metadata": {},
   "source": [
    "## Visualising the distribution of hourly pedestrian traffic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f95683fe-a402-414a-a204-c93bf11086ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sql bourke_daily_df << \n",
    "SELECT\n",
    "    Year,\n",
    "    Date_Time::DATE AS Date,\n",
    "    sum(Hourly_Counts)::BIGINT AS Daily_Counts,\n",
    "FROM pedestrian_counts\n",
    "WHERE Sensor_Name = 'Bourke Street Mall (North)'\n",
    "    AND Year IN (2019, 2020, 2021)\n",
    "GROUP BY Year, Date"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "98306cd4-df4d-4ddc-a9c9-725b7b42f3d9",
   "metadata": {},
   "outputs": [],
   "source": [
    "bourke_daily_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c562b2f7-4f95-4504-a18e-f5a613139182",
   "metadata": {},
   "outputs": [],
   "source": [
    "px.box(\n",
    "    bourke_daily_df,\n",
    "    x=\"Year\",\n",
    "    y=\"Daily_Counts\",\n",
    "    points=\"all\",\n",
    "    height=600,\n",
    "    title=\"Distributions of daily traffic for a sensor\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09111b8a-345f-4eb0-86b8-7bfea891b0c6",
   "metadata": {},
   "source": [
    "## Summary"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
